Artificial neural networks are universal function approximators, i.e. functions that can approximate any continuous function.
ANN consist of nodes, each of which does a computation on an input, and layers, which are collections of nodes that have access to the same inputs.
While there are many variations of neural networks the most common is the multi-layer perceptron or feed-forward neural network.
They can be applied to supervised learning (e.g. regression and classification), unsupervised learning, and reinforcement learning.
ANN form the basis of deep learning models.
Source: Introduction to Statistical Learning
Structure of a feed-forward NN with one hidden layer:
A neural network takes a vector of \(p\) variables \(X=(X_1,X_2,\ldots X_p)\) to build a nonlinear function \(f(X)\) to predict some outcome \(Y.\)
The resulting NN model with one hidden layer can be summarized as
\[f(X)=\beta_0+\sum_k \beta_k A_k = \beta_0+\sum_k \beta_k h_k(X)= \beta_0+\sum_k \beta_k\, g(w_{k0}+\sum^p_{j=1}w_{kj}X_j)\] where all parameters \(\beta_0 \ldots \beta_K, w_{10}\ldots w_{Kp}\) are estimated from the data.
\[g(z)=\frac{e^z}{1+e^z}=\frac{1}{1+e^{-z}}\]
\[g(z)= \begin{cases} 0, & z < 0 \\ z, & \textrm{otherwise} \end{cases}\]
The use of nonlinear activation functions is critical for generating high-quality approximations
Interpretation of a feed-forward one hidden-layer NN
Nonlinearity in \(g\) is basic: without it \(f(X)\) would collapse into a linear model.
Estimators for \(\theta = \beta_0 \ldots \beta_K, w_{10}\ldots w_{Kp}\) are the solution to the minimization of the penalized loss function
\[\sum (y_i-f(X))^2 + \lambda R(\beta,w)\]
Main ingredients:
Regularizers: LASSO and Ridge penalties
Slow learning is achieved in an iterative fashion using stochastic gradient descent (SGD).
Let \(f(\theta)\) be the loss function to minimize.
Gradient Descent Algorithm
Initialize: take an initial guess for the parameters \(\theta^0\)
Loop: for \(i=1,\ldots,M\):
Compute the gradient: \(\nabla f(\theta^{i−1})\)
Update the parameters: \(\theta^i=\theta^{i−1}−\eta \nabla f(\theta^{i−1})\)
Continue until parameters converge to a minimum: if \(∣f(\theta^i)−f(\theta^{i−1})∣<\epsilon\) break
Return \(\theta^i\)
Remarks: Gradient descent uses the entire dataset to compute the gradient. It can be slow for large datasets.
SGD computes gradients on a random sample of the data or batch (often a single observation); a full pass through all batches is called an epoch.
SGD scales well to large to massive datasets due to the redudes gradient computation per step by using only batches instead of the full dataset.
Using subsamples introduces stochasticity, as opposed to computing the full gradient (thus helping avoid local minima but making convergence noisier).
Stochastic Gradient Descent Algorithm
Initialize: take an initial guess for the parameters \(\theta^0\)
Loop: for \(i=1,\ldots,M\):
Random sampling: select a random sample of points (or one point)
Compute the gradient at the selected point(s): \(\nabla f(\theta^{i−1})\)
Update the parameters: \(\theta^i=\theta^{i−1}−\eta \nabla f(\theta^{i−1})\)
Continue until parameters converge to a minimum: if \(∣f(\theta^i)−f(\theta^{i−1})∣<\epsilon\) break
Return \(\theta^i\)
Optimization methods in neural networks offer regularization beyond just penalizing coefficient size. These include:
Dropout regularization is a common technique where each neuron is randomly set to zero with a certain probability (e.g., 0.1) during updates.
Early stopping monitors out-of-sample prediction accuracy alongside the in-sample objective function.
Source: Causal inference with ML and AI
Modern neural networks typically have more than one hidden layer, and often many units per layer. In theory a single hidden layer with a large number of units has the ability to approximate most functions. However, the learning task of discovering a good solution is made much easier with multiple layers each of modest size.
Neural network training requires selecting many tuning parameters typically chosen using validation methods.
Python
Implementations
sklearn offers two classes for deep learning: MLPRegressor and MLPClassifier
TensorFlow (open-source machine learning framework developed by Google) + Keras (user-friendly interface that runs on top of TensorFlow and simplifies complex tasks like model definition, training, and evaluation)
PyTorch (open-source deep learning framework developed by Facebook) + Pytorch Lightning (similar to Keras)